Modelling Computational Resources for Next Generation Sequencing Bioinformatics Analysis of 16S rRNA Samples

نویسندگان

  • Matthew J. Wade
  • Thomas P. Curtis
  • Russell J. Davenport
چکیده

In the rapidly evolving domain of next generation sequencing and bioinformatics analysis, data generation is one aspect that is increasing at a concomitant rate. The burden associated with processing large amounts of sequencing data has emphasised the need to allocate sufficient computing resources to complete analyses in the shortest possible time with manageable and predictable costs. A novel method for predicting time to completion for a popular bioinformatics software (QIIME), was developed using key variables characteristic of the input data assumed to impact processing time. Multiple Linear Regression models were developed to determine run time for two denoising algorithms and a general bioinformatics pipeline. The models were able to accurately predict clock time for denoising sequences from a naturally assembled community dataset, but not an artificial community. Speedup and efficiency tests for AmpliconNoise also highlighted that caution was needed when allocating resources for parallel processing of data. Accurate modelling of computational processing time using easily measurable predictors can assist NGS analysts in determining resource requirements for bioinformatics software and pipelines. Whilst demonstrated on a specific group of scripts, the methodology can be extended to encompass other packages running on multiple architectures, either in parallel or sequentially. Keywords— Computational performance, bioinformatics pipelines, Multiple Linear Regression modelling ∗Electronic address: [email protected]; Corresponding author 1 ar X iv :1 50 3. 02 97 4v 1 [ qbi o. G N ] 1 0 M ar 2 01 5

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Clustering 16S rRNA for OTU prediction: a method of unsupervised Bayesian clustering

MOTIVATION With the advancements of next-generation sequencing technology, it is now possible to study samples directly obtained from the environment. Particularly, 16S rRNA gene sequences have been frequently used to profile the diversity of organisms in a sample. However, such studies are still taxed to determine both the number of operational taxonomic units (OTUs) and their relative abundan...

متن کامل

The Isolation and Identification of Dominant Lactic Acid Bacteria by the Sequencing of the 16S rRNA in Traditional Cheese (Khiki) in Semnan, Iran

Background: Identification of the dominant lactic acid bacteria involved in the production of traditional cheese in Semnan could be the initiative to protect national genetic resources and produce industrial cheese with desirable texture and organoleptic characteristics similar to traditional cheeses. The present study aimed to determine the biochemical, physiological, and phenotypic properties...

متن کامل

A comprehensive evaluation of the sl1p pipeline for 16S rRNA gene sequencing analysis

BACKGROUND Advances in next-generation sequencing technologies have allowed for detailed, molecular-based studies of microbial communities such as the human gut, soil, and ocean waters. Sequencing of the 16S rRNA gene, specific to prokaryotes, using universal PCR primers has become a common approach to studying the composition of these microbiota. However, the bioinformatic processing of the re...

متن کامل

Resources and Costs for Microbial Sequence Analysis Evaluated Using Virtual Machines and Cloud Computing

BACKGROUND The widespread popularity of genomic applications is threatened by the "bioinformatics bottleneck" resulting from uncertainty about the cost and infrastructure needed to meet increasing demands for next-generation sequence analysis. Cloud computing services have been discussed as potential new bioinformatics support systems but have not been evaluated thoroughly. RESULTS We present...

متن کامل

16S rRNA Terminal Restriction Fragment Length Polymorphism for the Characterization of the Nasopharyngeal Microbiota

A novel non-culture based 16S rRNA Terminal Restriction Fragment Length Polymorphism (T-RFLP) method using the restriction enzymes Tsp509I and Hpy166II was developed for the characterization of the nasopharyngeal microbiota and validated using recently published 454 pyrosequencing data. 16S rRNA gene T-RFLP for 153 clinical nasopharyngeal samples from infants with acute otitis media (AOM) revea...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • CoRR

دوره abs/1503.02974  شماره 

صفحات  -

تاریخ انتشار 2015